Audio description from image by modal translation network

نویسندگان

چکیده

Audio is the main form for visually impaired to obtain information. In reality, all kinds of visual data always exist, but audio does not exist in many cases. order help people better perceive information around them, an image-to-audio-description (I2AD) task proposed generate descriptions from images this paper. To complete totally new task, a modal translation network (MT-Net) auditory sense proposed. The MT-Net includes three progressive sub-networks: 1) feature learning, 2) cross-modal mapping, and 3) generation. First, learning sub-network aims learn semantic features image audio, including learning. Second, mapping transforms into representation with same concept as feature. way, correlation inter-modal effectively mined easing heterogeneous gap between audio. Finally, generation designed waveform representation. generated interpolated corresponding file according sample frequency. Being first attempt explore I2AD large-scale datasets plenty manual are built. Experiments on verify feasibility generating intelligible directly effectiveness method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

grammatical adjustments in translation from english into persian by iranian students

بمنظور مشخص نمودن ارتباط میان برخی از ساختارهای دستور زبان انگلیسی و میزان دشواری این ساختارها در ترجمه سولات تحقیق و فرضیات صفر بشرح ذیل مطرح گردید: 1 - آیا دانشجویان ایرانی مشکلاتی در ارتباط با سازش دستوری در ترجمه از زبان انگلیسی به زبان فارسی دارند؟ 2 - آیا رابطه ای بین فرمهای دستوری زبان انگلیسی و میزان دشواری این فرمها در ترجمه از زبان انگلیسی به زبان فارسی وجود دارد؟ فرضیه 1 : دانشجویان ...

15 صفحه اول

Evaluating Visual Information Provided by Audio Description

متن کامل

From Image Annotation to Image Description

In this paper, we address the problem of automatically generating a description of an image from its annotation. Previous approaches either use computer vision techniques to first determine the labels or exploit available descriptions of the training images to either transfer or compose a new description for the test image. However, none of them report results on the effect of incorrect label d...

متن کامل

Cross-modal Visual-audio Priming

This study assessed whether presenting visual-only stimuli prior to auditory stimuli facilitates the recognition of spoken words in noise. The results of the study indicate that this type of cross-modal priming does occur. Future directions for research in this domain are presented.

متن کامل

Processing Multi-modal Primitives from Image Sequences

In this paper, we describe a new kind of image representation in terms of local multi–modal Primitives. Our local Primitives can be characterized by three properties: (1) They represent different aspects of the image in terms of multiple visual modalities. (2) They are adaptable according to context. (3) They provide a condensed representation of local image structure. These three properties ma...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Neurocomputing

سال: 2021

ISSN: ['0925-2312', '1872-8286']

DOI: https://doi.org/10.1016/j.neucom.2020.10.053